Approximation and Streaming Algorithms for Projective Clustering via Random Projections
نویسندگان
چکیده
Let P be a set of n points in R. In the projective clustering problem, given k, q and norm ρ ∈ [1,∞], we have to compute a set F of k q-dimensional flats such that ( ∑ p∈P d(p,F)) is minimized; here d(p,F) represents the (Euclidean) distance of p to the closest flat in F . We let f k (P, ρ) denote the minimal value and interpret f k (P,∞) to be maxr∈P d(r,F). When ρ = 1, 2 and ∞ and q = 0, the problem corresponds to the k-median, kmean and the k-center clustering problems respectively. For every 0 < ε < 1, S ⊂ P and ρ ≥ 1, we show that the orthogonal projection of P onto a randomly chosen flat of dimension O(((q + 1) log(1/ε)/ε) log n) will εapproximate f 1 (S, ρ). This result combines the concepts of geometric coresets and subspace embeddings based on the Johnson-Lindenstrauss Lemma. As a consequence, an orthogonal projection of P to an O(((q + 1) log((q + 1)/ε)/ε) log n) dimensional randomly chosen subspace ε-approximates projective clusterings for every k and ρ simultaneously. Note that the dimension of this subspace is independent of the number of clusters k. Using this dimension reduction result, we obtain new approximation and streaming algorithms for projective clustering problems. For example, given a stream of n points, we show how to compute an ε-approximate projective clustering for every k and ρ simultaneously using only O((n + d)((q + 1) log((q + 1)/ε))/ε log n) space. Compared to standard streaming algorithms with Ω(kd) space requirement, our approach is a significant improvement when the number of input points and their dimensions are of the same order of magnitude.
منابع مشابه
Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity
Clustering is a common problem in the analysis of large data sets. Streaming algorithms, which make a single pass over the data set using small working memory and produce a clustering comparable in cost to the optimal offline solution, are especially useful. We develop the first streaming algorithms achieving a constant-factor approximation to the cluster radius for two variations of the k-cent...
متن کاملSpectral Clustering via the Power Method - Provably
Spectral clustering is arguably one of the most important algorithms in data mining and machine intelligence; however, its computational complexity makes it a challenge to use it for large scale data analysis. Recently, several approximation algorithms for spectral clustering have been developed in order to alleviate the relevant costs, but theoretical results are lacking. In this paper, we pre...
متن کاملReconstruction of sparse signals from l1 dimensionality-reduced Cauchy random-projections
Dimensionality reduction via linear random projections are used in numerous applications including data streaming, information retrieval, data mining, and compressive sensing (CS). While CS has traditionally relied on normal random projections, corresponding to 2 distance preservation, a large body of work has emerged for applications where 1 approximate distances may be preferred. Dimensionali...
متن کاملAnalyzing graph structure via linear measurements
We initiate the study of graph sketching, i.e., algorithms that use a limited number of linear measurements of a graph to determine the properties of the graph. While a graph on n nodes is essentially O(n2)-dimensional, we show the existence of a distribution over random projections into d-dimensional "sketch" space (d « n2) such that several relevant properties of the original graph can be inf...
متن کاملVery Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections
The method of stable random projections [39, 41] is popular for data streaming computations, data mining, and machine learning. For example, in data streaming, stable random projections offer a unified, efficient, and elegant methodology for approximating the lα norm of a single data stream, or the lα distance between a pair of streams, for any 0 < α ≤ 2. [18] and [20] applied stable random pro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1407.2063 شماره
صفحات -
تاریخ انتشار 2015